Using Questionnaires to Gain Insight into Retest Effects 


AB-1 - Paper 

Using Questionnaires to Gain Insight into Retest Effects 

John Thain 

Defense Language Institute Foreign Language Center (DLIFLC) 

The Defense Language Institute Foreign Language Center (DLIFLC), located on the Presidio of Monterey, 
California, is the proponent for the Defense Language Aptitude Battery (DLAB); it is also the agency responsible 
for basic language training within the Department of Defense (DoD). DLAB is used to screen candidates for 
language training who have previously met Armed Services Vocational Aptitude Battery (ASVAB) requirements 
for language-related occupational specialties. Certain sections of DLAB are presented on audiotape, while others 
are presented in the test booklet. 

Recent studies have suggested that some parts of the DLAB could be reduced in scope to save administration 
time, while other parts of the DLAB could benefit from a corresponding lengthening of administration time. 
These studies have led DLIFLC to consider several options comprising relatively small-scale changes in the 
DLAB; these changes would retain most of the current test content, reduce the number of total test items, and 
retain current methods of test scoring and score interpretation. Part of data collection in support of this goal 
involved administration of a modified and reduced form of DLAB (DLAB 1.5) to a population that had 
previously taken the operational DLAB. DLAB 1.5 included all core items that would probably be included in a 
future version of DLAJB. In order to interpret DLAB 1.5 test data, questionnaires given to DLAB 1.5 examinees 
were extensively used to gain insight into retest effects. The objectives for DLAB 1.5 administration in order of 
importance were: (1) to determine if changing the time allotted (but not test content) for completing the printed 
part of the test would affect test statistics and predictive validity, (2) to determine if differences in audio quality 
in test administrations at MEPS would have any effect on retest scores for retained audio portions of the original 
test (in which no test content was changed), and (3) to field test revisions of four test items with faulty statistics 
in the printed portion of the test. 

This paper deals only with a comparison of a part of the original DLAB item response record with the item 
response record for DLAB 1.5, a reduced and modified version of the original test. Test scores in the data in this 
paper represent only the number correct in subpart totals and have no relation to the familiar DLAB standard 
scores used operationally. 

DLAB 1.5, the reduced and modified test, retained the following parts from the original DLAB: 

Audiotape paced parts. Part II (P2) involves identification of auditorially presented syllable stress patterns. It 
consists of 18 items requiring seven and one half minutes administration time; In P2, test administration is paced 
by audiotape. Part III (P3) involves deductive language learning of an artificial language; stimulus material is 
presented on audiotape. P3 consists of 44 items arranged into three subtests (P3S1, P3S2, and P3S3) and requires 
28 minutes administration time; like P2, it is also paced by audiotape. These two parts are identical to and require 
the same administration time as a corresponding block of 62 items in the original DLAB. 

Written part with time limit. Part IV (P4) involves inductive language learning based on pictures accompanied 
by printed artificial language work samples. It consists of 26 items carried over from the original DLAB plus the 
four replacement items referred to above. The examinee has 35 minutes to complete P4 in DLAB 1.5, rather than 
the 25 minutes allotted in the original DLAB for P4. 

In total, DLAB 1.5 consisted of 92 items and requires just over 70 minutes administration time, consisting of 88 
original items and 4 replacement items. The same number of items in original DLAB require slightly only 60 
minutes, because the P4 administration time in DLAB is 10 minutes less the P4 administration time in DLAB 
1.5. While increased administration time for P4 is one option for change in future DLAB revision, such an 
increase is not to be introduced operationally without reducing the administration time of the remaining other 
portions of the DLAB by a roughly equal or even greater amount. 

The idea for trying out an extended administration time for P4 was influenced by a DLI-funded psychometric 
study of DLAB conducted by O'Mara (1994). O'Mara drew a large sample from the "unrestricted" population 
(N=32,866) to analyze P4 examinee data collected during 1987-1988. The unrestricted population includes 
examinees with scores so low that they were not subsequently eligible for any language as well as examinees 
scoring high enough to be eligible for language training. Thus, in addition to "selected" examinees with complete 


DISTRIBUTION STATfcMfcN i A 
Approved for Public Release 

Distribution! ,n,;nr iited 


peQTTWH^OS 


19990422 023 




Using Questionnaires to Gain Insight into Retest Effects 


language training criterion data, this sample included examinees who had taken the DLAB, but for whom there 
was no criterion training data because they had not subsequently been assigned to language training. The data 
from this sample indicated that the series of items toward the end of P4 had progressively higher nonresponse 
rates as the end of P4 approached. The mean DLAB total test scores of the examinees not responding to each 
item were also increasing toward the end of P4. 

It was decided to administer DLAB 1.5 to a substantial sample of those students arriving at DLIFLC to attend 
language classes. These students had been selected to attend DLIFLC based on their initial DLAB scores at 
Military Enlistment Processing Stations (MEPS). 

DLIFLC was strategically placed to carry out such a study with a minimum of resources, Regulatory guidance 
insures that DLIFLC receives completed answer sheets from DLAB administrations at MEPS. DLIFLC was able 
to easily arrange DLAB 1.5 testing for incoming students upon arrival in Monterey. When all the original DLAB 
answer sheets from MEPS also arrived, it was easy to scan both sets of answer sheets and create a reduced item 
response record of original DLAB item response data record corresponding to just those 92 items retained or 
replaced one-for-one in the DLAB 1.5 test administration data record. Any number of "what if' comparisons 
could then be facilitated by cutting and pasting together smaller and more manageable databases composed of 
"virtual item response records" drawn from parts of the two sets of item response records. Subsequent criterion 
data and self-reported information from the examinees about both DLAB and DLAB 1.5 administrations could 
easily be added to such databases of interest. 

Part of the plan was to solicit the observations of the examinees taking the modified DLAB about the differences 
they perceived between the original DLAB administration at MEPS and the administration of DLAB 1.5 which 
they had just undergone. After the administration of DLAB 1.5 in Monterey, they responded to questionnaire 
items by marking multiple-choice alternatives on their test answer sheets. The numbering of the questionnaire 
items followed the last item number on the modified DLAB 1.5. Questionnaire response data were scanned along 
with the DLAB 1.5 test data. This made it possible to correlate the item responses on the DLAB and DLAB 1.5 
test administrations with the opinions of the examinees about the two test administrations. 

Another part of the original plan, as yet not carried out, was to subsequently gather the language training 
criterion data for these examinees who had been double-tested with both DLAB and DLAB 1.5. A variety of 
statistical analyses were to be carried out to suggest which version of DLAB maximized prediction benefits at 
which costs of test administration time tradeoffs. We hoped that criterion data would enable us to select an 
option with appropriate cutoff scores varying across foreign languages of varying difficulty. However, the most 
important criterion data were available only after the completion of the DLI basic course training, which lasted 
from 25 to 63 weeks dependent on the language. During the interim period, the DLI Research Division conducted 
an exploratory analysis of the questionnaire data and the data from the two test administrations. Partly due to 
changes in local research priorities, the criterion data, though now available in DLI school databases, has yet to 
be analyzed. In the absence of criterion data, the exploratory nature of the questionnaire analysis of retest data is 
even more pronounced and conclusions at this point are even more tentative than originally intended. 

During 1994, the DLAB 1.5 and the questionnaires were administered to 1058 students who had previously taken 
DLAB. DLAB 1.5 was administered in DLIFLC language laboratories through a central audio console to cohorts 
of thirty to sixty examinees wearing headphones. The original DLAB administration at MEPS took place under 
conditions varying by locality. 

Table I summarizes information about the first and second administrations. In columns under the label "first test 
administration," the raw mean score and maximum possible score are provided for the correspondingly reduced 
portion of the original DLAB item response set (labeled DLAB-R)—in other words, the original DLAB item 
response set reduced to just those items for which there is a counterpart in DLAB 1.5. Presented under the 
columns labeled "gain scores" are the gain scores of DLAB 1.5 over DLAB-R and the average p-value gain per 
item in each item set. 

TABLE I 

Raw score test gains in DLAB 1.5 Administration over Official DLAB administrations 
Mean Raw Score with Maximum Possible Score in Parentheses (N=1058) 
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From original 

administration 

From second 

administration 

Gain 

Scores 

Item set 

DLAB-R 

DLAB 1.5 

Gain 

Gain 

per item 

62 items from P2 and P3 

44.39(62) 

47.90(62) 

3.51(62) 

.06(62) 

Faulty original Part IV items 

Replacement Part IV Items 

2.20 (4) 

++ 

** 

2.58 (4) 

.38(4) 

.09(4) 

Part IV items not replaced 

16.34(26) 

18.42(26) 

2.08(26) 

.08(26) 






TOTAL SCORE 

62.92(92) 

68.90 (92) 

5.98(92) 

.07(92) 


++ Not present in original DLAB ** Replaced by replacement items in DLAB 1.5 

Table I indicates that the per item increase in mean raw score in the second administration for the initial set of 62 
items is much less than the increase per item in the common 26 items retained in P4. Above and beyond the bare 
data in Table I, further triangulation using questionnaire data and other information presented later in this paper 
seems to indicate that the gain scores in the first 62 items are largely due to conventional practice effects; 
however, differences in administration conditions between MEPS and DLIFLC related to the quality of audio 
subtests may also play a somewhat lesser role in these test gains. Questionnaire data presented later in this paper 
also suggest that a conventional practice effect seems to play as important a role as the increase in administration 
time in the gain for the last 26 common items; indeed there is surprisingly little strong evidence to suggest that 
the increase in administration time is the major factor responsible for the gain scores found in this last group of 
26 items. 

Had our ultimate goals not included introducing an operational change in administration time in P4, the task of 
merely determining the effect of the replacement of four items from the original test on the norming and score 
interpretation of a future operationally revised DLAB would have been much simpler. The means and standard 
deviations of the original DLAB subtests in the sample of 1058 students selected to enter DLIFLC closely tracks 
the typical means and standard deviations of other even larger samples drawn from the same population. The 
change resulting from introducing four replacement items into operational would have minimal effect, if any, on 
the standard score reporting system. 

If a gain score for the 26 common items from P4 could be somehow shown to result not from the change in 
administration time, but from conventional practice effect, such a gain score should not be reflected in new 
norms or standard scores for a revised DLAB with extended administration time for P4. One source of evidence 
as to the cause of P4 gains was to compare these gains to gains in the audio portion of the tests in which there 
was no extension of administration time; another way is was to consult examinee input about the relative 
significance of practice effects and extension of administration time. 

The audio portion of the DLAB 1.5 consisted of four groups of items P2, P3S1, P3S2, and P3S3. Table II 
indicates the relationship between examinees' self-reported perceptions about practice effect and their actual 
retest gains on these four audio subtests and the P4 common items. 1022 of the 1058 examinees responded to the 
question "Do you think having previously taken DLAB helped you taking this modified version?" 

TABLE II 

MEAN RAW SCORE GAINS FOR AUDIO TESTS AND P4 COMMON ITEMS 
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Examinees categorized in terms of own beliefs about the efficacy of practice effects 
(Mean increase in item p-values for each subtest in parentheses) 


How much 
help? 

N 

Pet 

P2 

P3S1 

P3S2 

P3S3 

P4 TOTAL 

Common 








Items 


Great deal 

209 

23.2% 

.83(.05) 

.59(.05) 

.45(.03) 

1.79(.ll) 

2.40(.09) 

6.06(.07) 

Somewhat 

556 

52.6% 

.60(.03) 

.45(.03) 

.21 (.02) 

1.73(.10) 

2.14(.08) 

5.13(.06) 

Maybe a 
little 

198 

18.7% 

.37(.02) 

.41(.03) 

-,26(-.02) 

1.84(.l 1) 

2.13(.08) 

4.49(.05) 

Not at all 

59 

5.6% 

-,15(-.01) 

.10(.01) 

.08(.01) 

.54(.03) 

1.14(.04) 

1.71 (.02) 

TOTAL 

1022 

100% 

.56(.03) 

.45(.03) 

.16(.01) 

1.69(.10) 

2.13(.08) 

4.99(.05) 


Because of the peculiarities of this data set, if N=1000, an average gain in p-values of .01 across 15 items (or a 
mean increase of .15 for a 15 item subtest) would probably be significant beyond the .05 level on a two-tailed 
paired samples T-test, whereas an average p-value gain of .03 per item might be required to be significant at the 
same level if the sample size were only 100. 

In the whole sample (N=1022), the 45 items in P2, P3S1, and P3S2 taken together display moderate (but highly 
statistically significant) average p-value gains of .025, while the 17 items in P3S3 displayed much greater 
average p-value gains of .10 per item (t=T6.21). Table II indicates that 94.5% of the examinees acknowledged 
some degree of practice effect and that the relatively few examinees (5.6%) not acknowledging a practice effect 
had much smaller test gains than the other examinees. It is not surprising from a psycholinguistic point of view 
that P3S3 and P4 would show larger gain scores in terms of conventional practice effects than P2, P3S1, and 
P3S2. The retest situation for these former subtests would be more likely to stimulate the examinee to reactivate 
cognitive and procedural "mental hooks than the latter subtests. While all subtests in DLAB are artificial in the 
sense that they are based on an artificial language, P3S3 and P4 tend to approach sentence level syntax and 
semantics rather than consisting of simple morphological or nonsense word content. 

TABLE III 

Where examinees thought previous experience helped 

and where experience actually did help the most 

(Mean increase in item p-values in each subtest in parentheses) 



N 

P2 

P3 

P4 

TOTAL 

P2 

143 

.49(.03) 

1.92(.04) 

1.40(.05) 

3.82(.04) 

P3 

260 

.79(.04) 

2.83(.06) 

1.69(.06) 

5.31(.05) 

P4 

211 

,73(.04) 

2.19(.05) 

2.95(. 11) 

5.87(.06) 

None 

283 

,41(.02) 

2.03(.05) 

2.20(.08) 

4.64(.05) 

TOTAL 

897 

,61(.03) 

2.28(.05) 

2.10(.08) 

4.99(,05) 


Examinees who thought that previous experience with the DLAB helped a great deal or helped somewhat were 
asked a further question: did their prior experience with DLAB help more with some parts of DLAB 1.5 than 
others? Table III shows that examinees who thought previous experience had helped them more on P3 or P4 did 
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in fact have much higher retest gains on corresponding test than persons who answered otherwise. However, 
examinees who said their previous experience had helped them most on P2 had lower retest gains across the 
board. 

Examinees were also queried concerning the administration of the audio portion of the DLAB at MEPS. While 
headphones were used in all the DLI DLAB 1.5 administration, only 14% of the examinees said that headphones 
were available for use in the official MEPS administration. In addition, while DLAB 1.5 was administered in 
cohorts of 30 or 60 examinees, 75% of the examinees reported taking MEPS DLAB alone or with one or two 
examinees, and only 6% reported MEPS administrations involving 10 or more examinees. 

Examinees were asked whether the audio portion was clearer during the official MEPS administration or during 
the DLAB 1.5 administration at DLIFLC . Table IV below indicates that all three groups responding to this 
question had overall test gains; those who believed DLI audio was better than MEPS audio had larger gains on 
audio tests than those who believed that there was no difference in audio between DLI and MEPS. Relatively few 
examinees thought that MEPS audio had been better than DLI audio; however this small subgroup had a 
consistent pattern of lower retest gains on audio tests. 

Although other questions were asked comparing test administration conditions at the MEPS and DLI, there was 
little evidence that any factors other than conventional practice effect and overall better audio at DLIFLC had 
any effect on audio retest scores. The data in Tables II, III, and IV taken together suggest that conventional 
practice effects were a more important factor in retest gains on audio tests than the improvement of audio quality 
at DLI over MEPS audio quality. 

TABLE IV 

How examinees compared MEPS audio and DLIFLC audio and how much examinees gained upon retest 
(Mean increase in item p-values in each subtest in parantheses) 



N 

Pet 

P2 

P31 

P32 

P33 

All 

audio 

P4 

common 








tests 

items 

DLI audio 
better 

467 

44.1% 

.66(.04) 

.55(.04) 

.37 (.03) 

1.82(.ll) 

3.40(.05) 

1.82(.ll) 

MEPS audio 
better 

45 

4.3% 

,29(.02) 

.09(.01) 

-,20(-.01) 

1.69(.10) 

1.87(.03) 

1.69(.10) 

No difference 

535 

51.6% 

.47(.03) 

.40(.03) 

.00(.00) 

1.55(.09) 

2.42(.04) 

1.55(.09) 

Total 

1047 

51.6% 

.55(.03) 

.45(.03) 

.16(.01) 

1.68(.10) 

2.83(.05) 

1.68(.10) 


The impetus to try out an increased administration time for P4 came from O'Mara's study of the DLAB item 
response data from the unrestricted (unselected) population. What was true of the unrestricted population did not 
hold up for the DLAB-R item and DLAB 1.5 response sets, which were derived exclusively from the tests of 
"selected" examinees with passing cutoff scores. In neither of these samples from the selected population did the 
mean total test scores of nonresponding examinees increase toward the end of P4. The number of omits was also 
minimal in both cases. 

Questionnaire data did not initially yield a clear picture of any group of examinees that benefited substantially 
enough from the expanded administration time in P4 to plausibly account for the total retest gains in P4. 
Examinees were asked to answer the following question with one of five options: Did you finish the last portion 
of the test before time was called? Table V indicates that all response groups answering this question showed 
retest gains on the first 10, and second 10, and final 6 common items in P4. Despite the fact that the number of 
omits was negligible (in both the official DLAB administration and the DLAB 1.5 administration), 16.5% of the 
DLAB 1.5 examinees responded that they "failed to finish" P4 on DLAB 1.5. Furthermore, examinees choosing 
"failed to finish" as a response did not score lower on the last six items in P4 in DLAB 1.5 than on earlier P4 
items. Nor was their mean score lower on the last six items than colleagues who responded differently on this 
questionnaire item. It is possible some respondents broadly interpreted "failure to finish" to mean "failure to be 
able to give fullest attention to all of the items regardless of test item sequence on the test under the conditions of 
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expanded administration time." In support of this interpretation, one notes that the first three response groups in 
Table IV (arguably the groups working till the last minute on DLAB 1.5) showed relatively higher retest gains on 
the last six items than the other two response groups. Nevertheless, overall comparison of Table IV data with 
other tables offers only weak support for assigning retest gains to the increased administration time rather a 
generalized conventional practice effect. 

TABLE V 

Self-report of adequacy of administration time for P4 common items in DLAB 1.5 and how much examinees 
gained upon retest 

(Mean increase in item p-values for each subtest in parantheses) 



N 

Pet 

P4- 

P4 

P4 

TOTAL P4 
Common Items 




(lrst 10) 

(2nd 10) 

(Last 6) 


Didn't finish 

175 

16.7 

.67(.07) 

.74(.07) 

.51(.08) 

1.92(.07) 

Finished, little or no 
time to check work 

91 

8.7 

.79(.08) 

.87(.08) 

.45(.08) 

2.11 (.08) 

Finished early, just able 
to check work 

174 

16.6 

1.12(.l 1) 

.71 (.07) 

.56(.09) 

2.39(.09) 

Finished early, checked 
work, still had time left 

294 

28.0 

.93(.09) 

77(.08) 

.20(.03) 

1.90(.07) 

Finished early, could 
have checked, but 

314 

30.0 

1.22(. 12). 

69(.07) 

.30(05) 

2.21(.09) 

didn't 







TOTAL 

1048 

100 

.99(10) 

.74(.07) 

.36(.06) 

2.09(.08) 


Official DLAB test scores at MEPS become part of the examinee's records. DLAB 1.5 administration data did 
not enter into examinees' official records. With this in mind, the questionnaire asked examinees whether they 
were motivated to do their best on DLAB 1.5. 75% responded that they were highly motivated and 24% 
responded unsteady motivation. Only 1% admitted not trying very hard. 

So far this paper has examined retest gains after categorizing examinees in terms of response to a single 
questionnaire item at a time. Our project also experimented with combining several questionnaire responses in 
order to create new coding variables. The scope of this paper allows us to do little more than briefly touch on 
such efforts. We attempted to code certain examinees' scores as more likely to reflect retest gains due to 
administration time and other examinees as reflecting a gain due to conventional practice effects. For example, 
examinees who did not finish the first DLAB administration, but who did finish the DLAB 1.5 administration 
with time to recheck responses, and also claimed that previous experience with DLAB didn't help them on the 
test would be high on a scale constructed to reflect "gain due to change in P4 administration time rather than 
conventional practice effect." Conversely examinees who finished the first DLAB administration before time 
was called, didn't check answers even though time was available on a subsequent DLAB 1.5 administration, 
confessed mediocre motivation to take DLAB 1.5, and claimed that previous DLAB experience helped them a 
great deal during P4 would be coded as low on the same scale (meaning that this kind of examinee's score could 
be assigned more to practice effects rather than administration time). Taken individually, all of these 
experimental recoded scales may seem to have a certain arbitrary ad hoc quality; a skeptic might even question 
whether test gain from changes in administration time could ever be cleanly separated from conventional practice 
effects in these circumstances. However, results from various scales seemed to converge on a conclusion. 
Conventional practice effects seemed to have more to do with bringing about P4 retest gains than did the change 
in administration time. 
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